---
title: Reading CSV and Parquet Data from S3 Using S3 Select
---

The PXF S3 connector supports reading certain CSV-format and Parquet-format data from S3 using the Amazon S3 Select service. S3 Select provides direct query-in-place features on data stored in Amazon S3.

When you enable it, PXF uses S3 Select to filter the contents of S3 objects to retrieve the subset of data that you request. This typically reduces both the amount of data transferred to Greenplum Database and the query time.

You can use the PXF S3 Connector with S3 Select to read:

- `gzip`-compressed or `bzip2`-compressed `CSV` files
- `Parquet` files with `gzip`-compressed or `snappy`-compressed columns

The data must be `UTF-8`-encoded, and may be server-side encrypted.

PXF supports column projection as well as predicate pushdown for `AND`, `OR`, and `NOT` operators when using S3 Select.

<div class="note"> Using the Amazon S3 Select service may increase the cost of data access and retrieval. Be sure to consider the associated costs before you enable PXF to use the S3 Select service.</div>

## <a id="s3_select_enable"></a>Enabling PXF to Use S3 Select

The `S3_SELECT` external table custom option governs PXF's use of S3 Select when accessing the S3 object store. You can provide the following values when you set the `S3_SELECT` option:

| S3-SELECT Value  | Description |
|-------|-------------------------------------|
| OFF    | PXF does not use S3 Select; the default. |
| ON    | PXF always uses S3 Select. |
| AUTO    | PXF uses S3 Select when it will benefit access or performance. |

By default, PXF does not use S3 Select (`S3_SELECT=OFF`). You can enable PXF to always use S3 Select, or to use S3 Select only when PXF determines that it could be beneficial for performance. For example, when `S3_SELECT=AUTO`, PXF automatically uses S3 Select when a query on the external table utilizes column projection or predicate pushdown, or when the referenced CSV file has a header row.

**Note**: The <code>IGNORE_MISSING_PATH</code> custom option is not available when you use a PXF external table to read CSV text and Parquet data from S3 using S3 Select.


## <a id="s3_select_parquet"></a>Reading Parquet Data with S3 Select

PXF supports reading Parquet data from S3 as described in [Reading and Writing Parquet Data in an Object Store](objstore_parquet.html). If you want PXF to use S3 Select when reading the Parquet data, you add the `S3_SELECT` custom option and value to the `CREATE EXTERNAL TABLE` `LOCATION` URI.

### <a id="parquet_compress"></a>Specifying the Parquet Column Compression Type

If columns in the Parquet file are `gzip`-compressed or `snappy`-compressed, use the `COMPRESSION_CODEC` custom option in the `LOCATION` URI to identify the compression codec alias. For example:

``` pre
&COMPRESSION_CODEC=gzip
```

Or,

``` pre
&COMPRESSION_CODEC=snappy
```

### <a id="parquet_cet"></a>Creating the External Table

Use the following syntax to create a Greenplum Database external table that references a Parquet file on S3 that you want PXF to access with the S3 Select service:

``` sql
CREATE EXTERNAL TABLE <table_name>
    ( <column_name> <data_type> [, ...] | LIKE <other_table> )
  LOCATION ('pxf://<path-to-file>?PROFILE=s3:parquet&SERVER=<server_name>&S3_SELECT=ON|AUTO[&<other-custom-option>=<value>[...]]')
FORMAT 'CSV';
```

<div class="note">You <i>must</i> specify <code>FORMAT 'CSV'</code> when you enable PXF to use S3 Select on an external table that accesses a Parquet file on S3.</div>

For example, use the following command to have PXF use S3 Select to access a Parquet file on S3 when optimal:

``` sql
CREATE EXTERNAL TABLE parquet_on_s3 ( LIKE table1 )
  LOCATION ('pxf://bucket/file.parquet?PROFILE=s3:parquet&SERVER=s3srvcfg&S3_SELECT=AUTO')
FORMAT 'CSV';
```

## <a id="s3_select_csv"></a>Reading CSV files with S3 Select

PXF supports reading CSV data from S3 as described in [Reading and Writing Text Data in an Object Store](objstore_text.html). If you want PXF to use S3 Select when reading the CSV data, you add the `S3_SELECT` custom option and value to the `CREATE EXTERNAL TABLE` `LOCATION` URI. You may also specify the delimiter formatter option and the file header and compression custom options.

### <a id="csv_header"></a>Handling the CSV File Header

<div class="note">PXF can read a CSV file with a header row <i>only</i> when the S3 Connector uses the Amazon S3 Select service to access the file on S3. PXF does not support reading a CSV file that includes a header row from any other external data store.</div>

CSV files may include a header line. When you enable PXF to use S3 Select to access a CSV-format file, you use the `FILE_HEADER` custom option in the `LOCATION` URI to identify whether or not the CSV file has a header row and, if so, how you want PXF to handle the header. PXF never returns the header row.

**Note**: You *must* specify `S3_SELECT=ON` or `S3_SELECT=AUTO` when the CSV file has a header row. Do not specify `S3_SELECT=OFF` in this case.

The `FILE_HEADER` option takes the following values:

| FILE_HEADER Value  | Description |
|-------|-------------------------------------|
| NONE    | The file has no header row; the default. |
| IGNORE  | The file has a header row; ignore the header. Use when the order of the columns in the external table and the CSV file are the same. (When the column order is the same, the column names and the CSV header names may be different.)|
| USE    | The file has a header row; read the header. Use when the external table column names and the CSV header names are the same, but are in a different order. |

If both the order and the names of the external table columns and the CSV header are the same, you can specify either `FILE_HEADER=IGNORE` or `FILE_HEADER=USE`.

<div class="note">PXF cannot match the CSV data with the external table definition when both the order and the names of the external table columns are different from the CSV header columns. Any query on an external table with these conditions fails with the error <code>Some headers in the query are missing from the file</code>. </div>

For example, if the order of the columns in the CSV file header and the external table are the same, add the following to the `CREATE EXTERNAL TABLE` `LOCATION` URI to have PXF ignore the CSV header:

``` pre
&FILE_HEADER=IGNORE
```

### <a id="csv_compress"></a>Specifying the CSV File Compression Type

If the CSV file is `gzip`- or `bzip2`-compressed, use the `COMPRESSION_CODEC` custom option in the `LOCATION` URI to identify the compression codec alias. For example:

``` pre
&COMPRESSION_CODEC=gzip
```

Or,

``` pre
&COMPRESSION_CODEC=bzip2
```

### <a id="csv_cet"></a>Creating the External Table

Use the following syntax to create a Greenplum Database external table that references a CSV file on S3 that you want PXF to access with the S3 Select service:

``` sql
CREATE EXTERNAL TABLE <table_name>
    ( <column_name> <data_type> [, ...] | LIKE <other_table> )
LOCATION ('pxf://<path-to-file>
    ?PROFILE=s3:text&SERVER=<server_name>&S3_SELECT=ON|AUTO[&FILE_HEADER=IGNORE|USE][&COMPRESSION_CODEC=gzip|bzip2][&<other-custom-option>=<value>[...]]')
FORMAT 'CSV' [(delimiter '<delim_char>')];
```

**Note**: Do not use the `(HEADER)` formatter option in the `CREATE EXTERNAL TABLE` command.

**Note**: PXF does not support the `SKIP_HEADER_COUNT` custom option when you read a CSV file on S3 using the S3 Select service.

For example, use the following command to have PXF always use S3 Select to access a `gzip`-compressed file on S3, where the field delimiter is a pipe ('|') character and the external table and CSV header columns are in the same order.

``` sql
CREATE EXTERNAL TABLE gzippedcsv_on_s3 ( LIKE table2 )
  LOCATION ('pxf://bucket/file.csv.gz?PROFILE=s3:text&SERVER=s3srvcfg&S3_SELECT=ON&FILE_HEADER=USE')
FORMAT 'CSV' (delimiter '|');
```
